Welcome to the CCLHDME R Workshop!

Agenda

Workshop Setup

R Studio IDE

Object Oriented Programming

my_number <- 8
my_number
## [1] 8
my_vector <- c(1,1,5,10,50)
my_vector
## [1]  1  1  5 10 50
my_character <- "awesome"
my_character
## [1] "awesome"
my_boolean <- TRUE
my_boolean
## [1] TRUE
my_dataframe <- cbind(my_vector,my_vector^2)
my_dataframe
##      my_vector     
## [1,]         1    1
## [2,]         1    1
## [3,]         5   25
## [4,]        10  100
## [5,]        50 2500

Classes

class(my_number)
## [1] "numeric"
class(pi)
## [1] "numeric"
class(my_boolean)
## [1] "logical"
class(FALSE)
## [1] "logical"
class("This object is made of characters")
## [1] "character"
class(as.Date("2015-07-27"))  
## [1] "Date"

A note on dates in R

  • R stores dates as the number of days since the epoch (1 Jan 1970)
  • Regular arithmetic operations can be conducted on dates
  • The lubridate package (part of tidyverse) has many date-related functions that make working with dates much easier
  • The function as.Date() converts character representations of a date to an object of class “Date” representing calendar dates
class(as.Date("2015-07-27"))  
## [1] "Date"

Functions

mean(my_vector)
## [1] 13.4
max(my_vector)
## [1] 50
paste0(my_character,"...", my_character)
## [1] "awesome...awesome"
sum(1,2,3,3,4,5)
## [1] 18
mean(c(1,2,3,4,5))
## [1] 3

making your own functions is a more advanced topic, but ultimately is essential for using R beyound the basics

myHypotenuse <- function(side1,side2) {sqrt(side1^2 + side2^2)}

myHypotenuse(10,20)
## [1] 22.36068
myHypotenuse(20,35)
## [1] 40.31129

Operators

20/5
## [1] 4
my_number*my_number
## [1] 64
2 > 3
## [1] FALSE
5^2 == 25
## [1] TRUE
"joe" %in% c("bill","sue","joe")
## [1] TRUE

Directories

getwd()
## [1] "C:/Users/rfisher/Desktop/CCLHDME_R_workshop"
#setwd("I:/CCLHDME/R proposal/CCLHDME_R_workshop")

Testing data types

is.character(my_number)
## [1] FALSE
is.numeric(my_number)
## [1] TRUE

Changing data types

my_number_as_character <- as.character(my_number)
my_character_as_number <- as.numeric(my_character)
print(my_number_as_character)
## [1] "8"
print(my_character_as_number)
## [1] NA

-trying to coerce types can lead to weird behavior

Data Structures

Packages in R

Get Documentation on R packages

help(package = "tidyverse")

Installing packages

#install.packages("tidyverse") 

Loading or Calling packages in R session

library(tidyverse)

Tidyverse

What is Tidy data?

On to working with data!

Download CSV File

Import data from the CSV

ca_id_data <- read_csv("./data/infectious-disease-cases-by-county-year-and-sex-2-27-19.csv")

Importing from other file types

Import data directly from the Open Data Portal using the API (Application Program Interface)

Now we pull the data from the API

ca_id_data_api <- read_csv("https://data.chhs.ca.gov/dataset/391d75bf-00fc-4fd7-b00d-9bd16bbe01c0/resource/741f25e1-db50-436a-a6a9-7b840176edbf/download/infectious-disease-cases-by-county-year-and-sex-2-27-19.csv")
str(ca_id_data_api)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 132927 obs. of  10 variables:
##  $ Disease   : chr  "Amebiasis" "Amebiasis" "Amebiasis" "Amebiasis" ...
##  $ County    : chr  "Solano" "Marin" "Kern" "Tulare" ...
##  $ Year      : num  2005 2005 2010 2001 2014 ...
##  $ Sex       : chr  "Total" "Male" "Female" "Female" ...
##  $ Count     : num  0 4 1 1 17 2 0 0 3 0 ...
##  $ Population: num  410570 121710 404863 186935 5127242 ...
##  $ Rate      : num  0 3.287 0.247 0.535 0.332 ...
##  $ CI. lower : num  0 0.895 0.006 0.014 0.193 ...
##  $ CI.upper  : num  0.898 8.415 1.376 2.98 0.531 ...
##  $ Unstable  : chr  "-" "*" "*" "*" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Disease = col_character(),
##   ..   County = col_character(),
##   ..   Year = col_double(),
##   ..   Sex = col_character(),
##   ..   Count = col_double(),
##   ..   Population = col_double(),
##   ..   Rate = col_double(),
##   ..   `CI. lower` = col_double(),
##   ..   CI.upper = col_double(),
##   ..   Unstable = col_character()
##   .. )

A brief note on “Factors” in R (advanced topic)

  • Factors are variables that generally “look like” character string variables, but are actually stored internally as integers, with the character strings being just the ‘labels’
  • Factors can be very useful, but for many basic “epi/data management” purposes (except multivariate regression), they can be easier to use if they are converted into “regular” character variables (e.g. my_string_variable <- as.character(my_factor_variable))
  • when reading in .csv data using “read.csv()”, all character sting variables are read in by default as factors. To avoid this, you can use the “as.is=TRUE” argument to the read.csv function, or you can use read_csv (from the tidyverse readr() package)
  • factor() function creates factors, which can be useful in a number of situations
ca_id_data_api_FACTOR <- read.csv("https://data.chhs.ca.gov/dataset/391d75bf-00fc-4fd7-b00d-9bd16bbe01c0/resource/741f25e1-db50-436a-a6a9-7b840176edbf/download/infectious-disease-cases-by-county-year-and-sex-2-27-19.csv")
str(ca_id_data_api)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 132927 obs. of  10 variables:
##  $ Disease   : chr  "Amebiasis" "Amebiasis" "Amebiasis" "Amebiasis" ...
##  $ County    : chr  "Solano" "Marin" "Kern" "Tulare" ...
##  $ Year      : num  2005 2005 2010 2001 2014 ...
##  $ Sex       : chr  "Total" "Male" "Female" "Female" ...
##  $ Count     : num  0 4 1 1 17 2 0 0 3 0 ...
##  $ Population: num  410570 121710 404863 186935 5127242 ...
##  $ Rate      : num  0 3.287 0.247 0.535 0.332 ...
##  $ CI. lower : num  0 0.895 0.006 0.014 0.193 ...
##  $ CI.upper  : num  0.898 8.415 1.376 2.98 0.531 ...
##  $ Unstable  : chr  "-" "*" "*" "*" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Disease = col_character(),
##   ..   County = col_character(),
##   ..   Year = col_double(),
##   ..   Sex = col_character(),
##   ..   Count = col_double(),
##   ..   Population = col_double(),
##   ..   Rate = col_double(),
##   ..   `CI. lower` = col_double(),
##   ..   CI.upper = col_double(),
##   ..   Unstable = col_character()
##   .. )
  • factor() function creates factors
class(factor(c('Salmonellosis', 'Campylobacteriosis', 'Amebiasis'))) 
## [1] "factor"

Exploring our data

View(ca_id_data_api)

Explore data using str()

str(ca_id_data_api)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 132927 obs. of  10 variables:
##  $ Disease   : chr  "Amebiasis" "Amebiasis" "Amebiasis" "Amebiasis" ...
##  $ County    : chr  "Solano" "Marin" "Kern" "Tulare" ...
##  $ Year      : num  2005 2005 2010 2001 2014 ...
##  $ Sex       : chr  "Total" "Male" "Female" "Female" ...
##  $ Count     : num  0 4 1 1 17 2 0 0 3 0 ...
##  $ Population: num  410570 121710 404863 186935 5127242 ...
##  $ Rate      : num  0 3.287 0.247 0.535 0.332 ...
##  $ CI. lower : num  0 0.895 0.006 0.014 0.193 ...
##  $ CI.upper  : num  0.898 8.415 1.376 2.98 0.531 ...
##  $ Unstable  : chr  "-" "*" "*" "*" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Disease = col_character(),
##   ..   County = col_character(),
##   ..   Year = col_double(),
##   ..   Sex = col_character(),
##   ..   Count = col_double(),
##   ..   Population = col_double(),
##   ..   Rate = col_double(),
##   ..   `CI. lower` = col_double(),
##   ..   CI.upper = col_double(),
##   ..   Unstable = col_character()
##   .. )

Explore data using print()

print(ca_id_data)
## # A tibble: 132,927 x 10
##    Disease County  Year Sex   Count Population  Rate `CI. lower` CI.upper
##    <chr>   <chr>  <dbl> <chr> <dbl>      <dbl> <dbl>       <dbl>    <dbl>
##  1 Amebia~ Solano  2005 Total     0     410570 0           0        0.898
##  2 Amebia~ Marin   2005 Male      4     121710 3.29        0.895    8.41 
##  3 Amebia~ Kern    2010 Fema~     1     404863 0.247       0.006    1.38 
##  4 Amebia~ Tulare  2001 Fema~     1     186935 0.535       0.014    2.98 
##  5 Amebia~ Los A~  2014 Fema~    17    5127242 0.332       0.193    0.531
##  6 Amebia~ Napa    2006 Total     2     131920 1.52        0.184    5.48 
##  7 Amebia~ Marip~  2011 Total     0      18237 0           0       20.2  
##  8 Amebia~ Shasta  2006 Male      0      85947 0           0        4.29 
##  9 Amebia~ Sutter  2003 Male      3      41653 7.20        1.48    21.0  
## 10 Amebia~ Lake    2001 Male      0      29796 0           0       12.4  
## # ... with 132,917 more rows, and 1 more variable: Unstable <chr>

Explore data using head()

head(ca_id_data, 10)
## # A tibble: 10 x 10
##    Disease County  Year Sex   Count Population  Rate `CI. lower` CI.upper
##    <chr>   <chr>  <dbl> <chr> <dbl>      <dbl> <dbl>       <dbl>    <dbl>
##  1 Amebia~ Solano  2005 Total     0     410570 0           0        0.898
##  2 Amebia~ Marin   2005 Male      4     121710 3.29        0.895    8.41 
##  3 Amebia~ Kern    2010 Fema~     1     404863 0.247       0.006    1.38 
##  4 Amebia~ Tulare  2001 Fema~     1     186935 0.535       0.014    2.98 
##  5 Amebia~ Los A~  2014 Fema~    17    5127242 0.332       0.193    0.531
##  6 Amebia~ Napa    2006 Total     2     131920 1.52        0.184    5.48 
##  7 Amebia~ Marip~  2011 Total     0      18237 0           0       20.2  
##  8 Amebia~ Shasta  2006 Male      0      85947 0           0        4.29 
##  9 Amebia~ Sutter  2003 Male      3      41653 7.20        1.48    21.0  
## 10 Amebia~ Lake    2001 Male      0      29796 0           0       12.4  
## # ... with 1 more variable: Unstable <chr>

Explore data using tail()

tail(ca_id_data, 10)
## # A tibble: 10 x 10
##    Disease County  Year Sex   Count Population  Rate `CI. lower` CI.upper
##    <chr>   <chr>  <dbl> <chr> <dbl>      <dbl> <dbl>       <dbl>    <dbl>
##  1 Yersin~ Amador  2003 Fema~     0      16572     0           0   22.3  
##  2 Yersin~ Colusa  2012 Male      0      11200     0           0   32.9  
##  3 Yersin~ Kings   2008 Total     0     151834     0           0    2.43 
##  4 Yersin~ Marip~  2004 Fema~     0       8791     0           0   42.0  
##  5 Yersin~ Lake    2004 Fema~     0      31503     0           0   11.7  
##  6 Yersin~ Monte~  2012 Fema~     0     205765     0           0    1.79 
##  7 Yersin~ Imper~  2013 Male      0      92756     0           0    3.98 
##  8 Yersin~ Nevada  2013 Total     0      97727     0           0    3.78 
##  9 Yersin~ Humbo~  2011 Total     0     135332     0           0    2.73 
## 10 Yersin~ Sacra~  2007 Total     0    1388086     0           0    0.266
## # ... with 1 more variable: Unstable <chr>

Explore data using summary()

summary(ca_id_data)
##    Disease             County               Year          Sex           
##  Length:132927      Length:132927      Min.   :2001   Length:132927     
##  Class :character   Class :character   1st Qu.:2004   Class :character  
##  Mode  :character   Mode  :character   Median :2008   Mode  :character  
##                                        Mean   :2008                     
##                                        3rd Qu.:2012                     
##                                        Max.   :2015                     
##      Count            Population            Rate         
##  Min.   :    0.00   Min.   :     555   Min.   :   0.000  
##  1st Qu.:    0.00   1st Qu.:   28946   1st Qu.:   0.000  
##  Median :    0.00   Median :  124049   Median :   0.000  
##  Mean   :   29.09   Mean   :  833883   Mean   :   3.783  
##  3rd Qu.:    1.00   3rd Qu.:  413167   3rd Qu.:   0.053  
##  Max.   :49975.00   Max.   :39059809   Max.   :2989.961  
##    CI. lower           CI.upper          Unstable        
##  Min.   :   0.000   Min.   :   0.009   Length:132927     
##  1st Qu.:   0.000   1st Qu.:   1.429   Class :character  
##  Median :   0.000   Median :   4.768   Mode  :character  
##  Mean   :   2.793   Mean   :  25.308                     
##  3rd Qu.:   0.002   3rd Qu.:  17.422                     
##  Max.   :2773.224   Max.   :3218.730

Basic data management tasks using tidyr and dplyr

library(tidyr)
library(dplyr)

Intro to the pipe operator: %>%

Select Observations

ca_id_data_filtered <- ca_id_data %>% filter(Sex == "Total", Year == 2015, County == "California")
head(ca_id_data_filtered, 15)
## # A tibble: 15 x 10
##    Disease County  Year Sex   Count Population   Rate `CI. lower` CI.upper
##    <chr>   <chr>  <dbl> <chr> <dbl>      <dbl>  <dbl>       <dbl>    <dbl>
##  1 Amebia~ Calif~  2015 Total   309   39059809  0.791       0.705    0.884
##  2 Anapla~ Calif~  2015 Total     4   39059809  0.01        0.003    0.026
##  3 Anthrax Calif~  2015 Total     0   39059809  0           0        0.009
##  4 Babesi~ Calif~  2015 Total     7   39059809  0.018       0.007    0.037
##  5 Botuli~ Calif~  2015 Total     1   39059809  0.003       0        0.014
##  6 Botuli~ Calif~  2015 Total     2   39059809  0.005       0.001    0.018
##  7 Botuli~ Calif~  2015 Total    11   39059809  0.028       0.014    0.05 
##  8 Brucel~ Calif~  2015 Total    30   39059809  0.077       0.052    0.11 
##  9 Campyl~ Calif~  2015 Total  8272   39059809 21.2        20.7     21.6  
## 10 Cholera Calif~  2015 Total     0   39059809  0           0        0.009
## 11 Ciguat~ Calif~  2015 Total     1   39059809  0.003       0        0.014
## 12 Coccid~ Calif~  2015 Total  3060   39059809  7.83        7.56     8.12 
## 13 Creutz~ Calif~  2015 Total    18   39059809  0.046       0.027    0.073
## 14 Crypto~ Calif~  2015 Total   375   39059809  0.96        0.865    1.06 
## 15 Cyclos~ Calif~  2015 Total    15   39059809  0.038       0.021    0.063
## # ... with 1 more variable: Unstable <chr>

Select Variables

ca_id_data_selected <- ca_id_data_filtered %>% select(Disease, Year, Count, Population)
head(ca_id_data_selected, 15)
## # A tibble: 15 x 4
##    Disease                                            Year Count Population
##    <chr>                                             <dbl> <dbl>      <dbl>
##  1 Amebiasis                                          2015   309   39059809
##  2 Anaplasmosis and Ehrlichiosis                      2015     4   39059809
##  3 Anthrax                                            2015     0   39059809
##  4 Babesiosis                                         2015     7   39059809
##  5 Botulism, Foodborne                                2015     1   39059809
##  6 Botulism, Other                                    2015     2   39059809
##  7 Botulism, Wound                                    2015    11   39059809
##  8 Brucellosis                                        2015    30   39059809
##  9 Campylobacteriosis                                 2015  8272   39059809
## 10 Cholera                                            2015     0   39059809
## 11 Ciguatera Fish Poisoning                           2015     1   39059809
## 12 Coccidioidomycosis                                 2015  3060   39059809
## 13 Creutzfeldt-Jakob Disease and other Transmissibl~  2015    18   39059809
## 14 Cryptosporidiosis                                  2015   375   39059809
## 15 Cyclosporiasis                                     2015    15   39059809

Create new variables

ca_id_data_mutated <- ca_id_data_selected %>% mutate(incidence_rate = round(((Count/Population)*100000), digits = 2))
head(ca_id_data_mutated, 15)
## # A tibble: 15 x 5
##    Disease                             Year Count Population incidence_rate
##    <chr>                              <dbl> <dbl>      <dbl>          <dbl>
##  1 Amebiasis                           2015   309   39059809           0.79
##  2 Anaplasmosis and Ehrlichiosis       2015     4   39059809           0.01
##  3 Anthrax                             2015     0   39059809           0   
##  4 Babesiosis                          2015     7   39059809           0.02
##  5 Botulism, Foodborne                 2015     1   39059809           0   
##  6 Botulism, Other                     2015     2   39059809           0.01
##  7 Botulism, Wound                     2015    11   39059809           0.03
##  8 Brucellosis                         2015    30   39059809           0.08
##  9 Campylobacteriosis                  2015  8272   39059809          21.2 
## 10 Cholera                             2015     0   39059809           0   
## 11 Ciguatera Fish Poisoning            2015     1   39059809           0   
## 12 Coccidioidomycosis                  2015  3060   39059809           7.83
## 13 Creutzfeldt-Jakob Disease and oth~  2015    18   39059809           0.05
## 14 Cryptosporidiosis                   2015   375   39059809           0.96
## 15 Cyclosporiasis                      2015    15   39059809           0.04

Sort

ca_id_data_mutated <- ca_id_data_mutated %>% arrange(-incidence_rate)
head(ca_id_data_mutated, 15)
## # A tibble: 15 x 5
##    Disease                         Year Count Population incidence_rate
##    <chr>                          <dbl> <dbl>      <dbl>          <dbl>
##  1 Campylobacteriosis              2015  8272   39059809          21.2 
##  2 Salmonellosis                   2015  5568   39059809          14.3 
##  3 HIV                             2015  4947   39059809          12.7 
##  4 Coccidioidomycosis              2015  3060   39059809           7.83
##  5 Shigellosis                     2015  2227   39059809           5.7 
##  6 Giardiasis                      2015  2151   39059809           5.51
##  7 E. coli Other STEC (non-O157)   2015   602   39059809           1.54
##  8 Legionellosis                   2015   452   39059809           1.16
##  9 Cryptosporidiosis               2015   375   39059809           0.96
## 10 E. coli O157                    2015   330   39059809           0.84
## 11 Amebiasis                       2015   309   39059809           0.79
## 12 Vibrio Infection (non-Cholera)  2015   242   39059809           0.62
## 13 Dengue                          2015   135   39059809           0.35
## 14 Listeriosis                     2015   130   39059809           0.33
## 15 Yersiniosis                     2015   122   39059809           0.31

Do it all together! This is the strength of dplyr and the pipe

ca_id_data_cleaned <- ca_id_data %>%
                      filter(Sex == "Total", Year == 2015, County == "California" ) %>%
                      select(Disease,Year,Count,Population) %>%
                      mutate(incidence_rate = round(((Count/Population)*100000), digits = 2)) %>%
                      arrange(-incidence_rate)

Practice data management and using the pipe

Making static tables with Kable

Make a Simple Static Table of top 20 diseases using Kable

cols<-c(1:3,5) #This is creating an object to define which columns from our data we want to use
kable(ca_id_data_cleaned[1:20,cols]) #This is "old school indexing" in R which is largely replaced by subsetting in tidyverse, but is still really important to know
Disease Year Count incidence_rate
Campylobacteriosis 2015 8272 21.18
Salmonellosis 2015 5568 14.26
HIV 2015 4947 12.67
Coccidioidomycosis 2015 3060 7.83
Shigellosis 2015 2227 5.70
Giardiasis 2015 2151 5.51
E. coli Other STEC (non-O157) 2015 602 1.54
Legionellosis 2015 452 1.16
Cryptosporidiosis 2015 375 0.96
E. coli O157 2015 330 0.84
Amebiasis 2015 309 0.79
Vibrio Infection (non-Cholera) 2015 242 0.62
Dengue 2015 135 0.35
Listeriosis 2015 130 0.33
Yersiniosis 2015 122 0.31
Lyme Disease 2015 100 0.26
Malaria 2015 97 0.25
Shiga Toxin Positive Feces (without culture confirmation) 2015 98 0.25
Typhus Fever 2015 78 0.20
Staphylococcus aureus Infection (cases resulting in death or ICU) 2015 59 0.15

A Brief Sidenote on “old school indexing”" in R

##   x y
## 1 1 a
## 2 2 b
## 3 3 c
## [1] a b c
## Levels: a b c
## [1] 3

Make a Fancy Static Table of top 20 diseases using Kable

kable(ca_id_data_cleaned[1:20,cols],
      align = c('l','c','c','c'),
      col.names = c("Disease","Year","Number of Cases","Incidence Rate"),
      caption = "Top 20 Communicable Diseases  by Incidence Rate in California, 2015") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive","bordered"), full_width = T,position = "left", font_size = 12) %>%
  footnote(general = c("Data Retrieved on 5/16/19 from https://data.chhs.ca.gov/dataset/infectious-disease-cases-by-county-year-and-sex"))
Top 20 Communicable Diseases by Incidence Rate in California, 2015
Disease Year Number of Cases Incidence Rate
Campylobacteriosis 2015 8272 21.18
Salmonellosis 2015 5568 14.26
HIV 2015 4947 12.67
Coccidioidomycosis 2015 3060 7.83
Shigellosis 2015 2227 5.70
Giardiasis 2015 2151 5.51
E. coli Other STEC (non-O157) 2015 602 1.54
Legionellosis 2015 452 1.16
Cryptosporidiosis 2015 375 0.96
E. coli O157 2015 330 0.84
Amebiasis 2015 309 0.79
Vibrio Infection (non-Cholera) 2015 242 0.62
Dengue 2015 135 0.35
Listeriosis 2015 130 0.33
Yersiniosis 2015 122 0.31
Lyme Disease 2015 100 0.26
Malaria 2015 97 0.25
Shiga Toxin Positive Feces (without culture confirmation) 2015 98 0.25
Typhus Fever 2015 78 0.20
Staphylococcus aureus Infection (cases resulting in death or ICU) 2015 59 0.15
Note:
Data Retrieved on 5/16/19 from https://data.chhs.ca.gov/dataset/infectious-disease-cases-by-county-year-and-sex

Making interactive tables with DT

Make simple interactive Table of top 20 diseases using DT

datatable(ca_id_data_cleaned[1:20,cols])

Make a fancy interactive Table of top 20 diseases using DT

datatable(ca_id_data_cleaned[1:20,cols],
          options = list(columnDefs = list(list(className = 'dt-center', targets = c(1,2,3))),pageLength = 10), 
          class = 'compact stripe nowrap hover order-column row-border', 
          rownames = FALSE,
          colnames = c('Disease' = 1,'Year' = 2,'Number of Cases' = 3,'Incidence Rate' = 4),
          caption = htmltools::tags$caption(style = 'caption-side:top; text-align:left; color:red; font-weight:bold;font-size: 130%;','Top 20 Communicable Diseases  by Incidence Rate in California, 2015'))

Practice making tables

Grouping data using group_by()

ca_id_data_grouped <- ca_id_data %>%
                      filter(Sex == "Total", County == "California") %>%
                      group_by(Disease) 

Summarize

ca_id_data_summarized <- ca_id_data_grouped %>%
                         summarise(avg_count      = mean(Count), 
                                   total_count    = sum(Count), 
                                   avg_population = mean(Population), 
                                   avg_rate = mean(Rate)
                                   ) %>%
                         arrange(-avg_rate)       

head(ca_id_data_summarized,15)
## # A tibble: 15 x 5
##    Disease                    avg_count total_count avg_population avg_rate
##    <chr>                          <dbl>       <dbl>          <dbl>    <dbl>
##  1 Hepatitis C, Chronic          29369.      440540      36835584.   80.5  
##  2 Hepatitis B, Chronic          10540.      158104      36835584.   29.1  
##  3 Campylobacteriosis             6093        91395      36835584.   16.5  
##  4 HIV                            5533.       82996      36835584.   15.1  
##  5 Salmonellosis                  4626.       69396      36835584.   12.5  
##  6 Coccidioidomycosis             2896.       43438      36835584.    7.81 
##  7 Giardiasis                     2075.       31127      36835584.    5.66 
##  8 Shigellosis                    1639.       24581      36835584.    4.49 
##  9 Amebiasis                       376         5640      36835584.    1.03 
## 10 Cryptosporidiosis               311.        4662      36835584.    0.841
## 11 E. coli O157                    271.        4068      36835584.    0.735
## 12 E. coli Other STEC (non-O~      203.        2029      37595546.    0.530
## 13 Legionellosis                   175.        2618      36835584.    0.464
## 14 Malaria                         134.        2011      36835584.    0.367
## 15 Vibrio Infection (non-Cho~      135         2025      36835584.    0.362

Practice Grouping and Summarizing

Practice Answer

ca_id_data_group_sum_practice <- ca_id_data %>%
                        filter(Sex == "Total") %>%
                        group_by(County, Disease) %>%
                        summarise(avg_count = mean(Count), 
                                   avg_rate = mean(Rate)
                                   ) %>%
                        arrange(-avg_rate)       

head(ca_id_data_group_sum_practice,15)
## # A tibble: 15 x 4
## # Groups:   County [14]
##    County          Disease              avg_count avg_rate
##    <chr>           <chr>                    <dbl>    <dbl>
##  1 Lassen          Hepatitis C, Chronic     267.      772.
##  2 Del Norte       Hepatitis C, Chronic     179.      637.
##  3 Madera          Hepatitis C, Chronic     429.      303.
##  4 Kings           Hepatitis C, Chronic     437.      297.
##  5 Amador          Hepatitis C, Chronic     106.      283.
##  6 Kern            Hepatitis C, Chronic    1653.      208.
##  7 Tuolumne        Hepatitis C, Chronic     112.      201.
##  8 Imperial        Hepatitis C, Chronic     317.      193.
##  9 Kern            Coccidioidomycosis      1319.      164.
## 10 Humboldt        Hepatitis C, Chronic     209.      158.
## 11 Trinity         Hepatitis C, Chronic      20.8     154.
## 12 Lake            Hepatitis C, Chronic      92.1     146.
## 13 Shasta          Hepatitis C, Chronic     232.      133.
## 14 Mendocino       Hepatitis C, Chronic     116.      133.
## 15 San Luis Obispo Hepatitis C, Chronic     345.      131.

10 MINUTE BREAK

Now we are going to move on to some more advanced applications

Mapping

Packages for Mapping

#install.packages("sp")
#install.packages("tigris")

library(sp)
library(tigris)

options(tigris_class = "sf")   # options are sp (spatial polygons) or sf (simple features)
options(tigris_use_cache = F)  # set to true to save locally

Access census geographic boundary data

counties2010 <- counties(state = '06', cb = T, year = 2010)

Visualize the boundaires and data quickly using the R plot function

plot(counties2010)

Explore structure of data

str(counties2010)
## Classes 'sf' and 'data.frame':   58 obs. of  9 variables:
##  $ GEO_ID    : chr  "0500000US06035" "0500000US06049" "0500000US06075" "0500000US06083" ...
##  $ STATE     : chr  "06" "06" "06" "06" ...
##  $ COUNTY    : chr  "035" "049" "075" "083" ...
##  $ NAME      : chr  "Lassen" "Modoc" "San Francisco" "Santa Barbara" ...
##  $ LSAD      : chr  "County" "County" "County" "County" ...
##  $ CENSUSAREA: num  4541.2 3917.8 46.9 2735.1 953.2 ...
##  $ geometry  :sfc_MULTIPOLYGON of length 58; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:201, 1:2] -120 -120 -120 -120 -120 ...
##   ..- attr(*, "class")= chr  "XY" "MULTIPOLYGON" "sfg"
##  $ COUNTYFP  : chr  "035" "049" "075" "083" ...
##  $ STATEFP   : chr  "06" "06" "06" "06" ...
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr  "GEO_ID" "STATE" "COUNTY" "NAME" ...
##  - attr(*, "tigris")= chr "county"

Explore the data

head(counties2010)

Mapping our Infectious Disease data

head(ca_id_data)
## # A tibble: 6 x 10
##   Disease County  Year Sex   Count Population  Rate `CI. lower` CI.upper
##   <chr>   <chr>  <dbl> <chr> <dbl>      <dbl> <dbl>       <dbl>    <dbl>
## 1 Amebia~ Solano  2005 Total     0     410570 0           0        0.898
## 2 Amebia~ Marin   2005 Male      4     121710 3.29        0.895    8.41 
## 3 Amebia~ Kern    2010 Fema~     1     404863 0.247       0.006    1.38 
## 4 Amebia~ Tulare  2001 Fema~     1     186935 0.535       0.014    2.98 
## 5 Amebia~ Los A~  2014 Fema~    17    5127242 0.332       0.193    0.531
## 6 Amebia~ Napa    2006 Total     2     131920 1.52        0.184    5.48 
## # ... with 1 more variable: Unstable <chr>

Preparing data for a merge

ca_id_data_formapping <- ca_id_data %>%
                         mutate(NAME = County) %>%
                         filter(Sex == "Total",Year == "2015", Disease == "Campylobacteriosis", NAME != "California")

Join data

counties2010_id<-left_join(counties2010, ca_id_data_formapping, by="NAME")
head(counties2010_id)
## Simple feature collection with 6 features and 18 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.0143 ymin: 33.46324 xmax: -119.027 ymax: 41.99761
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##           GEO_ID STATE COUNTY          NAME   LSAD CENSUSAREA COUNTYFP
## 1 0500000US06035    06    035        Lassen County   4541.184      035
## 2 0500000US06049    06    049         Modoc County   3917.770      049
## 3 0500000US06075    06    075 San Francisco County     46.873      075
## 4 0500000US06083    06    083 Santa Barbara County   2735.085      083
## 5 0500000US06091    06    091        Sierra County    953.214      091
## 6 0500000US06095    06    095        Solano County    821.765      095
##   STATEFP            Disease        County Year   Sex Count Population
## 1      06 Campylobacteriosis        Lassen 2015 Total     3      30969
## 2      06 Campylobacteriosis         Modoc 2015 Total     1       9507
## 3      06 Campylobacteriosis San Francisco 2015 Total   518     863108
## 4      06 Campylobacteriosis Santa Barbara 2015 Total   113     444900
## 5      06 Campylobacteriosis        Sierra 2015 Total     1       3147
## 6      06 Campylobacteriosis        Solano 2015 Total   116     429267
##     Rate CI. lower CI.upper Unstable                       geometry
## 1  9.687     1.998   28.307        * MULTIPOLYGON (((-119.9987 4...
## 2 10.519     0.266   58.592        * MULTIPOLYGON (((-120.4882 4...
## 3 60.016    54.959   65.412     <NA> MULTIPOLYGON (((-123.0139 3...
## 4 25.399    20.933   30.536     <NA> MULTIPOLYGON (((-119.7898 3...
## 5 31.776     0.805  176.918        * MULTIPOLYGON (((-120.0031 3...
## 6 27.023    22.330   32.410     <NA> MULTIPOLYGON (((-122.2667 3...

Package tmap

#install.packages("tmap")
library(tmap)

qtm mapping with tmap

qtm(counties2010_id)

qtm with fills

qtm(counties2010_id,
    fill = "Rate")

Colors!

#install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all()# show all ColorBrewer ramps

#webshot::install_phantomjs()
#install.packages("shiny")
#install.packages("shinyjs")
library(shiny)
library(shinyjs)
tmaptools::palette_explorer()

qtm with colors

qtm(counties2010_id, 
    fill = "Rate", 
    fill.palette = "Greens")

Can create a static map

tmap_mode("plot") # set tmap to static view mode
qtm(counties2010_id, 
    fill = "Rate", 
    fill.palette = "Greens") # Not Interactive

Can create an interactive map

tmap_mode("view") # set tmap to interactive view mode
qtm(counties2010_id, 
    fill = "Rate", 
    fill.palette = "Greens") # interactive

Can make a more complex map with customized popups

map1 <- tm_shape(counties2010_id) + 
  tm_polygons(col = "Rate",
              id="NAME",
              palette = "Greens",
              title = "Campylobacteriosis Rate 2015",
              popup.vars = c("Disease:" = "Disease", "Year:" = "Year", "Rate Per 100,000:" = "Rate"))
map1

Basemaps

Can change basemaps

-Here we will change the baselayer

map2 <- tm_basemap(leaflet::providers$Esri.WorldStreetMap)+
  tm_shape(counties2010_id) + 
  tm_polygons(col = "Rate",
              id="NAME",
              palette = "Greens",
              title = "Campylobacteriosis Rate 2015",
              popup.vars = c("Disease:" = "Disease", "Year:" = "Year", "Rate Per 100,000:" = "Rate"))
map2

Practice Changing Colors and Basemaps

Practice Answer

map3 <- tm_basemap(leaflet::providers$Stamen.Watercolor) +
  tm_shape(counties2010_id) + 
  tm_polygons(col = "Rate",
              id="NAME",
              palette = "Purples",
              title = "Campylobacteriosis Rate 2015",
              popup.vars = c("Disease:" = "Disease", "Year:" = "Year", "Rate Per 100,000:" = "Rate"))
map3

Can save the map as static image files/pdfs

tmap_save(map1, "California_Campylobacteriosis_Map_2015.pdf", height=10) # Static image file
tmap_save(map1, "California_Campylobacteriosis_Map_2015.png", height=6) # Static image file

Can save the map as interactive html

tmap_save(map2, "California_Campylobacteriosis_Map_2015.html") # interactive web map

Mapping with Leaflet!!

#install.packages("leaflet")
library(leaflet)

pal <- colorNumeric("Reds", counties2010_id$Rate)
lflt_map1 <- leaflet(counties2010_id) %>%
  addProviderTiles("Esri.WorldGrayCanvas") %>% 
  addPolygons(data=counties2010_id, 
              color = "white", 
              weight = 1, 
              opacity = 0.5,
              fillColor = ~pal(Rate),
              fillOpacity = 0.75, 
              popup = paste0("<b>County: </b>", counties2010_id$County,"<br>",
                             "<b>Disease: </b>", counties2010_id$Disease,"<br>",                             
                             "<b>Year: </b>", counties2010_id$Year,"<br>",
                             "<b>Rate: </b>", counties2010_id$Rate, " per 100,000")) %>%
  addLegend(position = "topright",
            pal = pal, 
            values = ~Rate,
            opacity = 0.75, 
            title="Rate per 100,000")

lflt_map1

Moving on to Plotting

Simple bar chart

top_plot <- ggplot((ca_id_data_cleaned[1:5,cols] %>% arrange(-incidence_rate)), aes(x = Disease, y = incidence_rate)) +
  geom_col() +
  ggtitle("Top 5 Diseases in California by Incidence Rate, 2015")
top_plot

Simple line plot

ca_id_data_campylobacteriosis_forplot <- ca_id_data %>%
    filter(Sex == "Total", County == "California", Disease == "Campylobacteriosis") %>%
    select(County, Sex, Year, Disease, Rate)

head(ca_id_data_campylobacteriosis_forplot, 5)
## # A tibble: 5 x 5
##   County     Sex    Year Disease             Rate
##   <chr>      <chr> <dbl> <chr>              <dbl>
## 1 California Total  2009 Campylobacteriosis  15.5
## 2 California Total  2005 Campylobacteriosis  12.9
## 3 California Total  2001 Campylobacteriosis  16.0
## 4 California Total  2014 Campylobacteriosis  19.8
## 5 California Total  2012 Campylobacteriosis  21.0

Campylobacteriosis line plot

campy_plot <- ggplot(ca_id_data_campylobacteriosis_forplot, aes(x = Year,  y = Rate)) +
  geom_line(aes(), size = 1) +
  ggtitle("California Campylobacteriosis Incidence Rate, 2001-2015")
campy_plot

Compare with other Diseases

ca_id_data_foodborne_forplot <- ca_id_data %>%
  filter(Sex == "Total", County == "California", Disease %in% c("Campylobacteriosis", "Salmonellosis", "Shigellosis", "Giardiasis", "Cryptosporidiosis", "E. coli O157"))%>%
  select(County, Sex, Year, Disease, Rate)

head(ca_id_data_foodborne_forplot, 5)
## # A tibble: 5 x 5
##   County     Sex    Year Disease             Rate
##   <chr>      <chr> <dbl> <chr>              <dbl>
## 1 California Total  2009 Campylobacteriosis  15.5
## 2 California Total  2005 Campylobacteriosis  12.9
## 3 California Total  2001 Campylobacteriosis  16.0
## 4 California Total  2014 Campylobacteriosis  19.8
## 5 California Total  2012 Campylobacteriosis  21.0

Foodborne Disease line plot

foodborne_plot <- ggplot(ca_id_data_foodborne_forplot, aes(x = Year, y = Rate)) +
  geom_line(aes(color = Disease), size = 1) +
  scale_x_continuous(breaks = seq(2001, 2015, 1)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(x = "Year", y = "Incidence Rate (Per 100,000)") +
  ggtitle("California Foodborne Disease Incidence Rates, 2001-2015")
foodborne_plot

Interactive Foodborne Disease line plot!!

#install.packages("plotly")
library(plotly)
ggplotly(foodborne_plot)

Can save the plot as static image files/pdfs

ggsave(file = "Cal_Foodborne_2001-2015.pdf",
       plot = foodborne_plot, # or give ggplot object name as in myPlot,
       width = 5, 
       height = 5, 
       units = "in", # other options c("in", "cm", "mm"), 
       dpi = 300)

ggsave(file = "Cal_Foodborne_2001-2015.png",
       plot = foodborne_plot, # or give ggplot object name as in myPlot,
       width = 5, 
       height = 5, 
       units = "in", # other options c("in", "cm", "mm"), 
       dpi = 300)

We can also save the interactive plotly-wrapped plot in html

#install.package("htmlwidgets")
library(htmlwidgets)
plotly_foodborne_plot <- ggplotly(foodborne_plot)
htmlwidgets::saveWidget(as_widget(plotly_foodborne_plot), "Cal_Foodborne_2001-2015.html")

Dig in to the Viz!

campy_2015_plot <- ggplot(data = (ca_id_data %>%
                                    filter(Sex == "Total", Year == "2015", Disease == "Campylobacteriosis", County != "California")),
        aes(x = reorder(County,-Rate), y = Rate))+
        geom_col(aes())+
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+
        labs(x = "County", y = "Incidence Rate (Per 100,000)")+
        ggtitle("California Campylobacteriosis Rate by County, 2015")
campy_2015_plot

ggplotly(campy_2015_plot)

There are many great resources for plotting in R.

  • This one is especailly valuable becuase of all the great theory and practice for making charts, and examples of R code for many many type of visualization https://www.data-to-viz.com/

Advanced Topic–Accessing Census API with TidyCensus

Shift to accessing Census Data

  • There are a few great packages for accessing Census API data.
  • The easiest one to use is the tidycensus package
  • You can install using install.packages(“tidycensus”)
#install.packages("tidycensus")
library(tidycensus)

# Load census key - Get a key at https://api.census.gov/data/key_signup.html and change it below!
my_census_api_key <- "39503dad5b1feef439deebd1a721c7a4aa4bee04"

# Make tidycensus aware of census key.  Also, To install your API key for use in future sessions, run this function with `install = TRUE`.
census_api_key(my_census_api_key) 

Set up the API pull

  • Now we have loaded our key we need to figure out which data we want to call
  • We need to specify geography and the variables we want
  • In this case, we want to get data at the County level, but it’s also available at other census geographies (i.e. tract, block groups, etc.): https://walkerke.github.io/tidycensus/articles/basic-usage.html
#Identify state(s) and county(s) or place(s) of interest.  YOu can find all available geographies and calls here:https://walkerke.github.io/tidycensus/articles/basic-usage.html
my_states <- c("06") # CA

#load the variables table to find what you want to look at/fetch
cenvar_table <-load_variables(year=2017, dataset = "acs5", cache=T)
B17001 <- filter(cenvar_table, str_detect(name, "B17001"))

#Take a look at the data frame to find the names of census tables. This will give you the table and variable codes. Use "View(cenvar_table)" and filter within the table to get the codes for a specific variable
View(B17001) 


#now you can identify a single variable or create a vector of variables
pop_total <- "B17001_001E" #can specify a single variable to pull
pop_poverty <- "B17001_002E" #can specify a single variable to pull

Pull Data from Census API using get_acs()

#Fetch Data of Interest.  Note some of the specifications here.  You specify the geometry.
## Please note: `get_acs()` now defaults to a year or endyear of 2017.

#here I pull tract level data for only the two vars we specified above (B17001_001E and B17001_002E) for Alameda County, CA 
county_poverty_acs5_2017 <- get_acs(geography = "county", 
                              variables = c(pop_total,pop_poverty), 
                              year=2017, survey="acs5",
                              state = my_states,
                              geometry = F)


#explore dataframe, it's long
head(county_poverty_acs5_2017, 3)
## # A tibble: 3 x 5
##   GEOID NAME                       variable   estimate   moe
##   <chr> <chr>                      <chr>         <dbl> <dbl>
## 1 06001 Alameda County, California B17001_001  1602357  1188
## 2 06001 Alameda County, California B17001_002   181194  4374
## 3 06003 Alpine County, California  B17001_001     1188   179

Process our ACS data to join it with our map!

  • Now we want a map of the % of the population that is in poverty by county
  • We need to prepare our data and create a new variable with the percentage that we want to map
# Select the columns of interest 
county_poverty_acs5_2017_wide <- county_poverty_acs5_2017 %>%
                                select("GEOID","variable","estimate") %>%
                                spread(key = variable, value = estimate)


# Rename columns
colnames(county_poverty_acs5_2017_wide) <- c("county", "pop_total", "pop_poverty")

county_poverty_acs5_2017_wide <- county_poverty_acs5_2017_wide %>%
                            mutate(pct_poverty = round((pop_poverty / pop_total) * 100, 1)) #create a variable with the % of the population that is under 18.

Create a Join Variable in our Map Data

  • Since we will want to join this with our county map data, we need to identify which variable to use to join. Unfortunately there isn’t a native variable that is common to both datasets
  • We need to use the substr() function to extract out the final 5 digits from the GEO_ID variable in our County level simple features dataframe
head(counties2010)
## Simple feature collection with 6 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.0143 ymin: 33.46324 xmax: -119.027 ymax: 41.99761
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##             GEO_ID STATE COUNTY          NAME   LSAD CENSUSAREA
## 187 0500000US06035    06    035        Lassen County   4541.184
## 188 0500000US06049    06    049         Modoc County   3917.770
## 189 0500000US06075    06    075 San Francisco County     46.873
## 190 0500000US06083    06    083 Santa Barbara County   2735.085
## 191 0500000US06091    06    091        Sierra County    953.214
## 192 0500000US06095    06    095        Solano County    821.765
##                           geometry COUNTYFP STATEFP
## 187 MULTIPOLYGON (((-119.9987 4...      035      06
## 188 MULTIPOLYGON (((-120.4882 4...      049      06
## 189 MULTIPOLYGON (((-123.0139 3...      075      06
## 190 MULTIPOLYGON (((-119.7898 3...      083      06
## 191 MULTIPOLYGON (((-120.0031 3...      091      06
## 192 MULTIPOLYGON (((-122.2667 3...      095      06
counties2010_acs <- counties2010 %>% 
  mutate(county = substr(GEO_ID,10,14))

head(counties2010_acs)
## Simple feature collection with 6 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.0143 ymin: 33.46324 xmax: -119.027 ymax: 41.99761
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##           GEO_ID STATE COUNTY          NAME   LSAD CENSUSAREA
## 1 0500000US06035    06    035        Lassen County   4541.184
## 2 0500000US06049    06    049         Modoc County   3917.770
## 3 0500000US06075    06    075 San Francisco County     46.873
## 4 0500000US06083    06    083 Santa Barbara County   2735.085
## 5 0500000US06091    06    091        Sierra County    953.214
## 6 0500000US06095    06    095        Solano County    821.765
##                         geometry COUNTYFP STATEFP county
## 1 MULTIPOLYGON (((-119.9987 4...      035      06  06035
## 2 MULTIPOLYGON (((-120.4882 4...      049      06  06049
## 3 MULTIPOLYGON (((-123.0139 3...      075      06  06075
## 4 MULTIPOLYGON (((-119.7898 3...      083      06  06083
## 5 MULTIPOLYGON (((-120.0031 3...      091      06  06091
## 6 MULTIPOLYGON (((-122.2667 3...      095      06  06095

Join ACS data with the SF mapping data

counties2010_acs_joined<-left_join(counties2010_acs, county_poverty_acs5_2017_wide, by="county")

Map it!

map_acs<-tm_shape(counties2010_acs_joined) + 
  tm_polygons(col = "pct_poverty",
              id="NAME",
              palette = "Reds",
              title = "2013-2017 Population Poverty Rate (%)",
              popup.vars = c("Percent of Population in Poverty" = "pct_poverty", "Total Population:" = "pop_total", "Population in Poverty" = "pop_poverty"))

map_acs

Advanced Topic–Shiny Apps

saveRDS(ca_id_data, file = "CCLHDME_R_shiny_app/data/ca_id_data.rds")

Other help resources

Notes on Workshop Development

That’s it for now…